Lab 10: Distributions

Author

Jesse Brunner

Published

April 8, 2025

In this class so far we have seen or played with two distributions. First, we learned about likelihood from a binomial distribution, where counting outcomes was approachable. But then we abandoned it and started using the normal or Gaussian distribution, both because it is familiar and so easy to work with, but also because it does a surprisingly good job of describing data when we don’t have any good reason to choose something else. Well, today is the day to remind ourselves of these other options when we do have reason not to use a Gaussian.

Our goals are to:

Binomial distribution (difficulty Level: 1)

Whenever we have a certain number of trials, each of which has some probability of “success,” however we define it, we should probably use a binomial distribution1.

hist(rbinom(n=1e4, size=40, prob=0.12))

The two key variables are the number of trials (\(N\) or size) and the probability of “success” (\(p\) or prob). The only tricky bit is that the probability must be between zero and one (inclusive). That makes it a bit harder to use in, say, a regression model.

a <- 0.5
b <- 0.5
curve(a + b*x, from=-5, to=5)
abline(h=c(0,1), lty=2)

While it might be possible to carefully select parameters that don’t introduce impossible probabilities, it is hard in most real-world settings. Thus, we usually use some sort of link function that transforms the output of our linear model into an appropriately bounded value for our distribution. There are two common ones, depending on the discipline from which you come: the logit and the probit.

Thinking of priors

Let’s pretend that we were modeling the probability of some event using a binomial distribution and wanted to use a logit link. As good researchers we should simulate some relationships, or even data, given our priors. Let’s also assume that we centered our data, so we expect our intercept, a, to be near zero. We also want to be cautious and not assume that slope, b, is positive or negative, but we think values closer to zero are more likely than very large or small values. So, following the habits we’ve developed, we assume priors for both that are normal and centered on zero. Let’s see what this looks like:

a <- rnorm(n=100, mean=0, sd=1) # pretty sure it will be close to zero
b <- rnorm(n=100, mean=0, sd=1) # less sure, but not too crazy

Let’s make up some x-values (predictors) and our response (y-values)

plot(x=c(-5,5), y=c(-5,5), type="n", ylab="mu", xlab="predictor")
for(i in 1:length(a)){
  curve(a[i]+b[i]*x, add=TRUE)
}

That seems like what we expected. But wait, we need to think about this on the probability scale. Right, let’s repeat our plotting, but now with the plogis() function to implement the logit link.

plot(x=c(-5,5), y=0:1, type="n", ylab="p", xlab="predictor")
for(i in 1:length(a)){
  curve( plogis(a[i]+b[i]*x) , add=TRUE)
}

It’s worth thinking about whether these are realistic. But let’s imagine we were encouraged to use even less informative priors on the slope. Say, normal with a standard deviation of 10. And no, this is not crazy. This is in fact what people I’ve worked with have been told to do.

b <- rnorm(n=100, mean=0, sd=10)
plot(x=c(-5,5), y=0:1, type="n", ylab="p", xlab="predictor")
for(i in 1:length(a)){
  curve( plogis(a[i]+b[i]*x) , add=TRUE)
}

Whoa! This seems to be implying that we are certain there is a very steep positive or a very steep negative relationship, but more subtle relationships are very rare. Somehow our “uncertainty” led to extremes being most likely.

This is what happens with link functions. You must consider your priors on the scale where they have their effect, as in, here, the probability of success they imply.

Your turn

In a typical univariate linear model, a one-unit increase in the predictor (\(\Delta x\)) leads to the same change in response, no matter where along the x-axis we start. That is, if we start at 0 and add 1 (\(\Delta x=1\)) we get the same \(\Delta y\) as we would if we started at 5 and added 1 (\(\Delta x=1\)).

I would like you to discern and plot how much a one unit change in our predictor, x, alters our response, prob, in a model with a logit or probit link.

  • Let’s assume a=0 and b=0.5.
  • Starting at x=-5, see how much a prob changes if we added 1. Repeat across a range of x-values.
  • Graph these changes in prob against the x-values to which you added 1.

Poisson distribution (difficulty level: 1.25)

I gave this a higher difficulty level only because it is unfamiliar to most people. But really, it’s pretty easy to grock with a bit of time and exposure.

Whenever we have a count of a number of events or discrete things in a unit of time or space, a Poisson is a pretty good distribution to consider. Some examples include the number of:

  • tree seedlings per quadrat
  • visits to a food source per hour
  • human births in a given hospital per day
  • number of species of phytoplankton per pond
hist(rpois(n=100, lambda=0.5), col=NULL, breaks = 0:15, 
     xlab="Number of things")
hist(rpois(n=100, lambda=2.5), col=NULL, border="red", breaks = 0:15, add=T)
hist(rpois(n=100, lambda=5), col=NULL, border="blue", breaks = 0:15, add=T)

Notice that with the Poisson you do not specify the number of trials. Instead, we are interested in the rate at which events happen out of a theoretically infinite number that could happen. If you have a set number of events that could happen, you’re back in binomial territory.

The Poisson is interesting in that it has only a single parameter, the average rate of events, or the mean, which is usually designated as \(\lambda\). What’s more, the variance is equal to the mean. Notice how the distributions become more spread out as the rate parameter increases in the figures above?

Using a Poisson likelihood to describe your data (or connect your expectation, \(\mu\), to your data) is simple; you simply need to make sure that the value of \(\lambda\) you feed the Poisson is greater than zero. It is simple to use a linear model (which, again, can go negative) to describe this rate if you exponentiate it. Thus, \[ \lambda = \exp(a + b \times x) \] or, as we more commonly write it, \[ \log(\lambda) = a + b \times x \] As always, we should consider our priors on the scale where they have their effect, i.e., the rate parameter describing the rate of events.

a <- rnorm(n=100, mean=0, sd=1) 
b <- rnorm(n=100, mean=0, sd=1) 

plot(x=c(-5,5), y=0:1, type="n", ylab=expression(lambda), xlab="predictor")
for(i in 1:length(a)){
  curve( exp(a[i]+b[i]*x) , add=TRUE)
}

Does that look right? Maybe we want to tame the more extreme slopes a bit…

Gamma-Poisson / negative binomial (difficulty level: 3)

So what if we have count data, but we see a really long tail? Or what if we expect that most counts will be low or even zero, but some small fraction should have lots and lots. Let me offer a few examples.

  • Imagine that the seedlings you are counting come from seeds, but the seeds are clustered or clumped together. Larger clumps would lead to more seedlings in a quadrat than those with small clumps or individual seeds. Or it could be seed hording by rodents. Or…
  • If you were counting the number of honey bee visits to flowers you might see that some flowers had a lot of visitors because early visitors taught their hive-mates where to find that flower.
  • Parasite burdens are often highly over-dispersed2. This might be because some parasites replicate in their host, so those who were infected earlier would tend to have more parasites in them. Or it could be because parasites “weaken” their hosts, making them more easily infected by more parasites. Or it could be differences in exposure; juvenile ticks are often found clustered in the environment, so a mouse that happens to run through a tick bomb like this would have a huge burden compared to one that crossed only one or two.

One way to think of this is that our counts reflect an essentially random process happening at some rate, \(\lambda\), but that the rate varies from plot to plot or flower to flower or host to host. If we are willing to assume that the rate is not a constant, but instead is a gamma distributed random variable—a not unreasonable assumption (see below)—then our counts reflect a gamma-Poisson process.

# generate a bunch of values of lambda from a gamma distribution
lambdas <- rgamma(n=1e4, shape = 2, rate=1/2)
hist(lambdas)

summary(lambdas)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0416  1.9177  3.3266  3.9957  5.3892 22.0406 
# generate random observations from a Poisson  
counts_gammapois <- rpois(n=1e4, lambda=lambdas) # with our varying lambdas
counts_pois <- rpois(n=1e4, lambda = mean(lambdas)) # or a constant value

hist(counts_gammapois, breaks = 0:max(counts_gammapois))
hist(counts_pois, breaks = 0:max(counts_gammapois),
     col=NULL, border = "red", add=TRUE)

summary(counts_gammapois)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   3.000   3.999   6.000  28.000 
summary(counts_pois)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   3.987   5.000  12.000 

Notice that the over-dispersed version (gray bars) with varying values of \(\lambda\) has more zeros and more high values than would be expected under the Poisson-only version (red) with a constant value of \(\lambda\). That should make some sense because rather than having a single rate of about 4, some observations come from a Poisson with a rate much lower or much higher than 4. This is the gamma-Poisson.

Another name for this is the negative binomial.

counts_nb <- rnbinom(n=1e4, mu = 4, size = 2)

hist(counts_gammapois, breaks = 0:max(counts_nb))
hist(counts_pois, breaks = 0:max(counts_nb),
     col=NULL, border = "red", add=TRUE)
hist(counts_nb, breaks = 0:max(counts_nb),
     col=NULL, border = "blue", add=TRUE)

See? We can recapitulate the gamma-Poisson distribution (gray bars) with the negative binomial (blue)…because they are the same distribution. If you do the math, you’ll see the relationship between parameters of the gamma and parameters of the negative binomial (in particular, the over-dispersion parameter, size, often written as \(k\)… I think size of the NB = 1/rate of the gamma, and vice versa). The larger the value of size, the more the results converge on the Poisson.

When I started looking at tick burdens on hosts and realized they were strongly over-dispersed, I got very excited about the possibility of inferring the process that led to the pattern in the data. A little reading about the negative binomial distribution, however, was a splash of cold water. There are a bazillion ways of arriving at the NB (See, for instance, the wikipedia page), from sequences of failures before \(x\) successes to balls in urns to some of the sorts of examples I’ve given. In short, it’s impossible to infer process from pattern alone. Again.

It’s also worth noting that there are different ways of parameterizing the negative binomial that are mathematically equivalent. I’m using the “ecological” version, but you’ll see it specified with probability of success (of the binomial trials in the sequence of failures before \(x\) successes) and size, as well. They’re the same, really, but I find the ecological version more comprehensible.

Again, be careful of your priors and their influences on the sorts of observations you golem expects to see. Here be dragons.

gamma distribution (difficulty level: 2.5)

Gamma distributions are often used to describe waiting time or time to event data (e.g., time to death, time to mating). But they are also useful in that they describe continuous data that are \(\geq 0\) and are quite flexible. If the normal or log-normal distribution is not appropriate for your continuous, positive data, give the gamma a gander3.

curve(dgamma(x, shape=1, scale=1), from=0, to=5) # exponential!
curve(dgamma(x, shape=2, scale=1/3), col = "red", add=T)
curve(dgamma(x, shape=3, scale=1/3), col="blue", add=T)

Note that when shape = 1, we get the exponential distribution. (You can also get the chi-square distribution from the gamma, but since most of us aren’t especially familiar with the chi-square except as a point of comparison for our statistical tests, I won’t go into it.)

There are several ways of parameterizing this distribution, so be careful with parameters (usually rate = 1/scale and vice versa). No matter the parameterization the two parameters must be positive.

In this shape + rate version, the mean is shape\(\times\)rate (so the black and blue lines have the same mean of one, but the red one has a mean of 2/3). Since we usually use a model to describe the expectation or average value of our data, it would be nice to parameterize the gamma with a mean, directly. We can do this by redefining the gamma function in R, like so, where mu is the mean value

dgamma2 <- function(x, mu, scale, log=FALSE){
  dgamma(x, shape = mu/scale, scale = scale, log = log)
}

curve(dgamma2(x, mu=1, scale=1), from=0, to=5) # exponential!
curve(dgamma2(x, mu=2/3, scale=1/3), col = "red", add=T)
curve(dgamma2(x, mu=1, scale=1/3), col="blue", add=T)

Indeed, that is what McElreath created with his rethinking::dgamma2() function. Now you can model the mean, \(\mu\), as a (linear) function of your predictors and treat the scale parameter separately.

Since both mu and scale need to be positive, but probably not too large, an exponential distribution might be appropriate for priors on these two parameters.

mus <- rexp(n=1e2, rate=1)
scales <- rexp(n=1e2, rate=1)

plot(x=c(0,5), y=c(0, 4), type="n", 
     ylab="Probability density", 
     xlab="x")
for(i in 1:length(mus)){
  curve(dgamma2(x, mu=mus[i], scale=scales[i]), add=T)
}

We might also consider using normal distributions for the parameters and then exponentiating them, like so:

mus <- exp(rnorm(n=1e2, mean=0, sd=1))
scales <- exp(rnorm(n=1e2, mean=0, sd=1))

plot(x=c(0,5), y=c(0, 4), type="n",
     ylab="Probability density", 
     xlab="x")
for(i in 1:length(mus)){
  curve(dgamma2(x, mu=mus[i], scale=scales[i]), add=T)
}

Since the mean, \(\mu\), must be positive, we might link our linear model to this mean with a log link, as we did above: \[ \log(\mu) = a + b \times x \] Or really any link or scientific model that keeps \(\mu > 0\) will suffice.

Your turn

We’ve talked a bit about how unreasonably well the normal distribution does at describing data, even if there is no a priori reason to expect strict normality. Let’s test this idea.

  • Simulate data (with \(n=100\)) from a simple, linear model with one of these non-normal distributions (e.g., binomial, Poisson, gamma). You can choose whatever parameters you like, but make sure you know what they represent.
  • Fit the “right” model, the one from which you simulated your data.
  • Fit a model with the same structure, only using a normal distribution to describe your data.
  • Compare how well both models estimated the parameters of your linear model. Were they close or was the normal approximation biased? Any ideas of why?
  • Compare their expected predictive performance using PSIS or WAIC. How different were they? Does this make sense to you? Is it advisable?

Footnotes

  1. Recall that a Bernoulli distribution is a binomial where there is only one trial. So two for the price of one!↩︎

  2. Not to be confused with the over-dispersed from spatial statistics, which in essences means more regular than expected… why, oh, why does the same word mean opposite things?!↩︎

  3. Sorry, couldn’t help it!↩︎